Operating System - Persistence

CS537 - Operating System Summary Part 2 Persistence

Persistence

IO Device and Communication Protocol

Motivation

  • we want hardware that will let us plug in different devices
  • we want OS that can interact with different combinations of HW

Hardware support for IO device in system architecture

  • different buses have different speed, costs, size/volume of devices that need to be connected with
  • high speed buses are very expensive to manufacture
  • Hierarchical buses are a good solution
  • proprietary bus: 60GB/s on a 4-core system
  • General I/O bus: PCI…etc. 1-4GB/s
  • Peripheral I/O bus: disk devices, SCSI, SATA, USB, 100MB/s
  • Modern system hierarcical uses more specialized chipset and p2p interconnects for better performance. Example Z270 Chipset:
  • Dedicated Graphics bus: facilitate graphics intensive applications such as gaming, interactive web browser, and photo manipulations.
  • Higher performance devices connected via PCIe, NVMe persistent storage.
  • lower performance devices connected via USB, eSATA: modern sata standard, SSD: higher speed storage

OS’s view to Device & a canonical device

  • Interface: where OS reads/writes to, allow system to control its operations
  • Internal Structure (Varies depends on different devices & manufacture): microcontroller, extra memory, special-purpose chips…connection to cache/disk drives, graphics cards…

A canonical protocol OS writing to device

1
2
3
4
5
6
while (STATUS == BUSY)
; // spin
Write data to DATA register
Write command to COMMAND register
while (STATUS == BUSY)
; // spin
  • Simple polling protocol works but is inefficient sometimes:
CPU sys_write A waits copy Data & Command to A wait(for command to be executed) B
DISK busy busy busy
  • The policy of polling itself reduces CPU utilization when job processing time can be long
  • Using interrupt instead
1
2
3
4
5
6
while (STATUS == BUSY) // 1
wait for interrupt;
Write data to DATA register // 2
Write command to COMMAND register // 3
while (STATUS == BUSY) // 4
wait for interrupt;
  • Interrupt improves CPU utilization in this case.
  • Summary: Polling/Interrupt is a tradeoff.
    • Faster devices:
      • better to spin(poll) and keep waiting than taking interrupt overhead
    • Unknown device speed:
      • Hybrid approach (spin then use interrupts)
    • Better not to use interrupts when Floods of interrupts arrive:
      • Example: floods of requests to the NIC device of a webserver
      • can lead to livelock (always handling interrupts rather than doing actual works - user level processes to service some requests)
      • Better to ignore interrupts and use some polling to make some progress handling them and control what is happening in the system
    • Other improvement
      • Interrupt coalescing (batch together several interrupts into a single one): This reduces overhead of interrupts processing

Data Transfer Costs & More efficient data movement with DMA

  • Programmed I/O:
    • Programmed IO(PIO) is a method of transferring data between the CPU and a peripheral, such as a network adapter or an ATA storage device. Each data item transfer is initiated by an instruction in the program, involving the CPU for every transaction.
    • when using PIO to transfer a large chunk of data to a device. CPU is overburdened with trivial tasks of copying data from memory to device explicitly one word at a time. Poor CPU utilization!
    • Solution: Direct Memory Access (DMA)
      • CPU let a special purpose device “DMA engine” to copy data on behalf of it.
      • OS would program the DMA engine by telling it where the data lives inmemory, how much data to copy, and which device to send it to
      • CPU is thus free and OS can do something else: this improves both CPU and disk utilization, and improves the time of copying data into the data register in a device. (not command register because commands are usually very small in size).

1
2
3
4
5
while (STATUS == BUSY) // 1
;
Write command to COMMAND register // no step 2 anymore, 3
while (STATUS == BUSY) // 4
;

Methods of Device interactions

  • How OS communicates with Device? (Doesn’t matter much, both are used)
    • Special instructions
      • each device has a port
      • in/out instructions (x86) communicate with device
    • Memory-Mapped I/O
      • H/W maps registers into address space
        • example: eax ebx cpu register
      • loads/stores of mapped register sent to device
        • load/store eax/ebx

Device Drivers

  • Motivation: we want to keep device general and neutral as much as possible and hide the details of device interactions with OS subsystems.
  • Device driver: At the lowest level, a piece of software in the OS must know in detail how a device works. Any specifics of device interaction are encapsulated within.
  • Significance: Writing device driver for each device helps us abstract hardware and avoid writing different OS for different H/W combinations.
  • Example: we want a file system that works with SSD, USB, SATA
  • Problem: Many devices in a system! Each has its own protocol!
    • Drivers are 70% of Linux Source code and major causes of OS crashes.

Hard Disks

Hard Disk Interface and its view to OS/User

  • The abstraction to OS/User
sector 0
sector 1
sector 2
  • Disk has a sector-addressable address space
  • Appears as an array of sectors
  • Similar to Paging, sectors are typically 512 bytes/sector
  • Main operations: atomic read/write to a particular sector. When power failure, can have guranntee that r/w to a sector is done or not.
  • Mechanical and slow

Internals & Performance Measure

  • Platter: a circular/disk shape entity with magnetic foam on both sides
  • Spindle: connected with motor to make platter spin.
  • RPM(Rotations Per Minutes): tells the rate of platter spinning. 10000RMP = 1 rotation/6ms
  • Surface: one side of a platter. Both sides can be written/read.
  • tracks: a ring of certain inner & outer radius, surface is divided into different tracks. Tracks are divided into numbered sectors. Each track in above graph has 8 sectors.
  • cylinder: stack of tracks across platters. This idea is useful when we want to do uniform operations on the same track of each surfaces.
  • Arm seeks over desired tracks, platter rotates! A head per surface for R/W!
  • Reading/Writing data from disks
    • Rotation Delay: the waiting time for the platter to rotate till the head is positioned at right sector on the single track. On average R/2.
    • Seek Time: the waiting time for disk arm to be positioned on the right track.
    • Transfer Time: actual time data is either read from or written to the surface.
    • Overall Time to Read/write
      • Time_IO = seek + rotation + transfer
      • IO rate (mainly used for comparing drives performance):
        • IO_rate = size to transfer / Time_IO
    • Summary
      • Seek cost (major): Function of cylinder distance. Not purely linear cost. Must accelerate, coast, decelerate, settle. Settling alone can take 0.5 - 2 ms. Entire seeks often takes 4 - 10 ms. Average seek = 1/3 of max seek.
      • Rotate cost (major): Depends on rotations per minute (RPM). 7200 RPM is common, 15000 RPM is high end. Average rotation = 1/2 of rotation delay
      • Transfer time: pretty fast. depends on RPM and sector density. 100+ MB/s is typical for maximum transfer rate.
  • IO time calculation Example
    • Find the time for 4KB random read for Cheetah (on average).
    • Solution:
      • Tseek = 4ms
      • Trotate = 15000R/60s = 15000R/60000ms = 0.25R/ms = 4 R/ms = 2R/ms on average
      • Ttransfer = 4KB/(125MB/s) = 4/125 ms

Workload Performance

  • Question: How does two kinds of workload affect performance?
    • Sequential: reads a large number of sectors consecutively from the disk, without jumping around.
    • Random: issues small (e.g., 4KB) reads to random locations on the disk.
    • Example:
      • What is throughput (IO rate) for sequential workload and random workload for Cheetah?
        • Sequential: 4ms seek + 2ms rotate + 1s transfer = 1.006s This means effective throughput is almost equal to 125MB/s
        • Random: IO time = 6ms, 4KB/6ms <<< 125MB/s much lower throughput than sequential access.
      • Conclusion: When at all possible, transfer data to and from disks in a sequential manner. If sequential is not possible, at least think about transferring data in large chunks: the bigger, the better. If I/O is done in little random pieces, I/O performance will suffer dramatically.

Some techniques manufacturers use to improve performance of disks

Track Skew (skewed layout):

  • How should sector number be laid out so that we can continue reading sequentially?
  • Goal: We want low overhead and seamless transformation from 15 to 16 when we want to read 16 after 15.
  • Solution with track skew method:
    • idea: overlapping seek and rotation
    • By the time the platter rotate, the head already place at position of 16.

Zones

  • Idea: outer tracks have more area available than inner tracks and thus can store more data. But we fixed sector size. So we can have non-uniform division: more sectors on outer tracks to utilize that space.
  • Zone bit recording: call collection of sectors as zone.

Cache inside drive

  • Idea: Drives may cache both reads and writes. (In addition to OS cache). Cache is not big (2MB-16MB)
  • Advantages of caching in drive for read:
    • Store recently read sectors. Fetch it from cache.
    • Read-ahead: read contents of entire track into cache. predictively facilitates sequential workload.
  • Advantages & Disadvantages of caching in drive for write:
    • Immediate reporting: CPU doesn’t need to wait for write to finish. Can acknowledge a write even before the write actually makes it to the magnetic medieum.
    • Danger: cached data can be lost on power failure.
  • Other advantages: multiple outstanding requests
    • Tagged command queuing: Disk can reorder/schedule requests for better performance.

IO scheduler, scheduling policies and tradeoff

Motivation

Given a stream of I/O requests, in what order should they be served?

  • Example timeline: P1Read___P2Read___P3write__

Goal

  • OS should dispatch requests in certain order to the shared storage device disk.

Key Problem

  • Much different than CPU scheduling, Position of disk head relative to request position matters more than length of job
  • Example:
    • FCFS/FIFO: Assume seek+rotate = 10 ms for random request, How long (roughly) does the below workload take? Requests are given in sector numbers:
      • 300001, 700001, 300002, 700002, 300003, 700003 = 60ms because each time we need to seek and rotate
      • 300001, 300002, 300003, 700001, 700002, 700003 = 20ms 2 sequential pattern
        -This shows why IO scheduling is important.

Crux

  • we want to implement an algorithm that more closely approximates SJF by taking both seek and rotation into account.

SSTF (Shortest SEEK Time First)

  • Strategy always choose request that requires least seek time (time for seeking and rotating)
  • Greedy algorithm (looks for local optimal)
  • Implementation in OS: use sector number as a substitite, order by nearest sector number first, try to issue requests that are closely together.
  • Disadvantages: starvation!
    • ex. 30001,30002,…,70001(starved)
    • avoid starvation:
      • Scan/Elevator Algorithm: Sweep back and forth, from one end of disk other, serving requests as pass that cylinder; Sorts by cylinder number(order of tracks); ignores rotation delays;
        • Example: input 101 201 102 301 203; output_order 101 201 301 203 102 (first bit track#)

        • This ensure for example the request at outermost track does not starve!

        • This is a “Best effort” work done on OS side -> logically like sorting

      • C-SCAN(Circular Scan Algorithm): Only sweep in one direction. 1->2->3 reset 1->2->3
        • This is more fair than SCAN because in pure backand-forth SCAN middle one 2 is treated more times on average than peripheral 1 and 3.
  • Problem: SCAN and SSTF do not actually adhere as closely to the principle of SJF as they could. In particular, they ignore rotation.

SPTF(Shortest Positioning Time First)

  • Example: we get 2 requests one to 16, one to 8. 16 gets shorter seek but longer rotate, 8 has shorter rotation delay and longer seek. If seek time is higher than rotational delay in the disk in this example, then SSTF related policies are fine = we want to minimize seek time. So go to 16. Otherwise 8 is a better choice because we need to minimize rotation delay to have better performance.

  • This algorithm is complex, for simplicity, many OS only implements shortest seek time first

Where should IO scheduler go? OS vs Disk.

  • Disk: it knows disk geometry much better but fixed by hardware, can be hard to update firmware or the scheduler.
  • OS: knows processes that are requesting, so we can do some weighted or fair sharing across processes. Easy to update scheduler software. Can have multiple disks scheudler and change scheduler based on workload pattern.
  • Typical state of the art approach: has a simple scheduler in OS side that sorts the requests based on sector locations, then has one more complex scheduler on disk side to take account of seek time and rotation time in order to minimize things further. But typically OS sends multiple requests to disk, so disk scheduelr can do some reordering between them.

How busy should we keep the disk?

1
2
3
4
5
6
7
8
9
10
11
//This is a procedure that reads from a file 
//descriptor 1KB at a time and process that.
void reader(int fd) {
char buf[1024];
int rv;
while((rv = read(buf,fd)) != 0) {
assert(rv);
// takes short time, e.g., 1ms
process(buf, rv);
}
}

  • assume 2 processes calling read() with C-SCAN,consider workload pattern: 100 101 200 201 102 103 202. This pattern is possible because there is maybe 1 ms gap between 101 and 102 so that other threads(processes) can interrupt. This is very inefficient. Should the OS always submit requests waiting present on the queue or should wait to see if other requests arrive (BE work-conserving and let disk be idle at some point so that we can make more efficient progress in the future)?

  • Work conservation (a trick used by linux scheduler)

    • not work conserving/violating work-conservation: might wait for a while to merge or get a better sequence of requests
    • Work conserving schedulers always try to do work if there’s work to be done (always run a request if resource is free)
      • Sometimes, it’s better to wait instead if system anticipates another request will arrive.
      • example: I/O Merging. OS coalesces several IO requests into ONE. Less IO overhead.

RAID

Motivation

  • Typical scenario
APP
OS FS
Storage Devices:most file systems work with one disk
  • sometimes we need many disks for reasons:

    • Capacity
    • Performance
    • Reliability
  • Solution1 JBOD - Just a bunch of disks

    • Applications store data on different FS, ex. critical data that app decides to replicate
    • Downsides: need to know multiple devices, need to be rewritten, not deployable
  • Solution2 RAID - Redundant Array of Inexpensive (Independent) Disks

    • abstract multiple physical disks into one logical disk to OS
    • Advantages: transparent to apps, deployable Improved capacity, performance, and reliability!

Fault Model of RAID

  • Simple: Fail-stop model
  • Either works correctly or fails entirely
  • System can easily detect which part is not working
  • No silent failures, No corruptions, …etc…

General strategy of RAID - Mapping

  • Mapping blocks: build fast, large disk from smaller ones
  • Very similar to VM: go from virtual space to physical space by looking up TLB and pagetable
  • VM mapping is dynamic - mapping can change for example, when memory is free and is reallocated.
  • RAID mapping is static - translation is simple static calculation: no lookup

General strategy of RAID - Redundancy

  • Add even more disks for reliability
  • More redundancy == More fault tolerance
  • This is a tradeoff
    • Redundancy improves reliability (and maybe performance)
    • Deduplication improves space efficiency

RAID analysis

  • RAID level: different levels
  • Workload: types of reads/writes issued by app
  • Metric: capacity, reliability, performance
  • Analysis mode: given Workload, Raid level, determine/calculate Metric

RAID levels

  1. Stripping
  2. Mirroring
  3. Parity
  4. Rotated parity
    We will not discuss 2, 3, 6 in this class.

Workload

Metrics

打赏
  • 版权声明: 本博客所有文章除特别声明外,均采用 Apache License 2.0 许可协议。转载请注明出处!
  • © 2016-2020 th2zz

请我喝杯咖啡吧~

支付宝
微信